Skip to content

Conversation

@akshayutture-augment
Copy link
Contributor

* Dual writer: mode 3

* Add integration tests for playlits in mode 3

* Remove todo

* Update pkg/apiserver/rest/dualwriter_mode3.go

Co-authored-by: Arati R. <[email protected]>

* Admin: Fixes an issue where user accounts could not be enabled (#88117)

Fix: unable to enable user

* [REVIEW] FInish mode 3 and add tests

* Improve logging

* Update dependencies

* Update pkg/apiserver/rest/dualwriter_mode3_test.go

Co-authored-by: maicon <[email protected]>

* remove test assertion

* Use mode log when dual writer is initiated

---------

Co-authored-by: Arati R. <[email protected]>
Co-authored-by: gonvee <[email protected]>
Co-authored-by: maicon <[email protected]>
@akshayutture-augment
Copy link
Contributor Author

@claude review this PR

1 similar comment
@akshayutture-augment
Copy link
Contributor Author

@claude review this PR

@claude
Copy link

claude bot commented Nov 20, 2025

Claude finished @akshayutture-augment's task —— View job


PR Review: Dual Storage Architecture - Mode 3 Implementation

I've completed my review of this PR. This implements Mode 3 of the dual writer pattern, which writes to both LegacyStorage and Storage (new unified store) but reads exclusively from Storage. Here are my findings:


🔍 Critical Issues

1. Inconsistent Metrics Recording in DeleteCollection (pkg/apiserver/rest/dualwriter_mode3.go:166)

_, err := d.Legacy.DeleteCollection(ctx, deleteValidation, options, listOptions)
d.recordStorageDuration(err != nil, mode3Str, options.Kind, method, startLegacy)  // ❌ Should be recordLegacyDuration

Issue: Line 166 calls recordStorageDuration instead of recordLegacyDuration for the legacy storage operation. This will corrupt metrics by attributing legacy operation latency to storage metrics.

Fix: Change to d.recordLegacyDuration(err != nil, mode3Str, options.Kind, method, startLegacy)

2. Wrong Metric Label in Delete Method (pkg/apiserver/rest/dualwriter_mode3.go:106)

d.recordStorageDuration(false, mode3Str, name, method, startStorage)  // ❌ Using 'name' instead of 'options.Kind'

Issue: Line 106 passes name (the object name) as the kind parameter instead of options.Kind. This breaks metric aggregation since each object name becomes a separate metric label value, potentially causing cardinality explosion.

Fix: Change to d.recordStorageDuration(false, mode3Str, options.Kind, method, startStorage)

3. Context Propagation Issue (Multiple locations)

In Create, Delete, Update, and DeleteCollection methods, the context is used in goroutines after being modified with a timeout:

go func() {
    ctx, cancel := context.WithTimeoutCause(ctx, time.Second*10, errors.New("legacy create timeout"))
    defer cancel()
    // ... uses ctx
}()

Issue: The context ctx is captured from the outer scope and then shadowed. If the parent request is cancelled, the goroutine's context will be cancelled immediately, defeating the purpose of the 10-second timeout. This is a subtle but important bug.

Fix: Use a new context derived from context.Background() with relevant values copied, or use context.WithoutCancel(ctx) (Go 1.21+):

go func() {
    bgCtx := context.WithoutCancel(ctx)  // or context.Background()
    timeoutCtx, cancel := context.WithTimeoutCause(bgCtx, time.Second*10, errors.New("legacy create timeout"))
    defer cancel()
    // ... use timeoutCtx
}()

⚠️ Significant Issues

4. Goroutine Leaks Potential

All legacy operations are executed in goroutines without any mechanism to wait for completion or handle panics. If a legacy operation panics, it could crash the application.

Recommendation: Add recover() in goroutines:

go func() {
    defer func() {
        if r := recover(); r != nil {
            log.Error(fmt.Errorf("panic in legacy operation: %v", r), "legacy operation panicked")
        }
    }()
    // ... rest of the code
}()

5. Missing Error Logging Context in Delete Method (pkg/apiserver/rest/dualwriter_mode3.go:97)

ctx = klog.NewContext(ctx, d.Log)  // ❌ Should use 'log' variable

Issue: Line 97 creates context with d.Log instead of the enriched log variable that includes name, kind, and method. This means errors logged within Delete won't have this context.

Fix: Change to ctx = klog.NewContext(ctx, log)

6. Inconsistent Error Handling Between Methods

  • In Create (line 45): Records metrics with mode3Str, options.Kind, method, startStorage when error occurs
    • But the metric says "Legacy" duration is recorded (recordLegacyDuration) - this seems wrong
    • Should be recordStorageDuration since the Storage.Create failed
  • The error path records legacy duration instead of storage duration, which doesn't make sense.

Fix: Line 45 should call d.recordStorageDuration(true, mode3Str, options.Kind, method, startStorage) instead of recordLegacyDuration.


💡 Code Quality & Best Practices

7. Duplicate Import Statement

In the test file changes, there's improved test coverage which is excellent. However, in dualwriter_mode1_test.go:119, a duplicate line was removed:

p := prometheus.NewRegistry()  // Removed duplicate

Good cleanup, but verify that the global p variable is properly initialized for all tests.

8. Test Coverage Improvements

The test file dualwriter_mode3_test.go went from commented-out tests to comprehensive unit tests. Great improvement! However:

  • Tests don't verify that legacy operations are actually called (no assertions on legacy mock calls)
  • Tests don't verify the asynchronous nature (goroutines complete correctly)
  • Consider adding tests for context cancellation scenarios

Recommendation: Add assertions to verify legacy storage is called:

// After dw.Create(...)
time.Sleep(20 * time.Millisecond)  // Give goroutine time to complete
m.AssertCalled(t, "Create", mock.Anything, mock.Anything, mock.Anything, mock.Anything)

9. Integration Test Coverage

Excellent addition of Mode 3 integration tests in playlist_test.go. The tests cover:

  • File storage + Mode 3
  • Unified storage + Mode 3
  • etcd + Mode 3 (skipped but defined)

This is thorough and follows the existing pattern well.


📊 Performance Considerations

10. 10-Second Timeout May Be Too Long

All legacy operations have a 10-second timeout. For a system that's migrating away from legacy storage, this seems generous and could cause resource buildup if legacy storage is slow.

Recommendation: Consider making this configurable or using a shorter timeout (1-2 seconds) since legacy operations are asynchronous and non-blocking.

11. No Backpressure Mechanism

If legacy storage is consistently slow or failing, goroutines will keep spawning without limit. Under heavy load, this could lead to resource exhaustion.

Recommendation: Consider using a worker pool or semaphore to limit concurrent legacy operations.


✅ Positive Aspects

  1. Excellent Logging: All operations now have structured logging with kind, name, method, and mode
  2. Metrics Instrumentation: Good use of metrics to track storage vs legacy performance
  3. Simplified Update Logic: The Update method is much cleaner - removed the complex updateWrapper logic
  4. Async Legacy Operations: Fire-and-forget pattern is correct for Mode 3 where legacy is secondary
  5. Comprehensive Testing: Great test coverage including integration tests

📝 Summary

Must Fix Before Merge:

  1. ❌ Line 166: Wrong metric recording (recordStorageDurationrecordLegacyDuration)
  2. ❌ Line 106: Wrong label in metric (use options.Kind instead of name)
  3. ❌ Line 45: Wrong metric recording (recordLegacyDurationrecordStorageDuration)
  4. ❌ Line 97: Wrong context logging (use log instead of d.Log)
  5. ❌ Context propagation in all goroutines (use context.WithoutCancel)

Should Fix:

  • Add panic recovery in all goroutines
  • Add test assertions for legacy storage calls
  • Consider shorter timeout for legacy operations

Minor Improvements:

  • Make timeout configurable
  • Consider backpressure mechanism for legacy operations

Overall, this is a solid implementation of Mode 3, but the critical metric recording bugs and context propagation issue need to be fixed before merging.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants